Skip to Main Content
University of York Library
Library Subject Guides

Essential Spreadsheets: a Practical Guide

T4: Processing

Spreadsheets were developed as a tool to store, analyse and manipulate numerical data. They are now commonly used for working with sets of data containing both text and numbers, and for generating graphs and charts. They're also brilliant. Let's play with

Cross-referencing and interrogating information is a very database-y activity, but it's something spreadsheets can do too, to some extent. On this page we take a look at lookup functions, pivot to the topic of pivot tables, then query Google Sheets' QUERY function and Excel's Query Editor. We also look at some other functions for reorganising data.

Lookup functions

Lookup functions provide a means to reference data stored as a table elsewhere in a spreadsheet, and so extract values from it for use elsewhere.

There are four main lookup functions:

LOOKUPLocates a supplied value in an ordered column and returns a value from the same row in another column
VLOOKUPLocates a supplied value (or its inferred position in a range of values) in the first column of a range of data and returns the value from the same row in any other specified column — very widely used as the lookup function of choice since 1985 but now superseded by XLOOKUP
HLOOKUPWorks the same as VLOOKUP but with data that is arranged in rows rather than columns
XLOOKUPIntroduced in 2019 as an improved version of VLOOKUP and HLOOKUP — it has the immediate simplicity of the LOOKUP function, can work both vertically and horizontally, and can search in various ways without the need for the data to be in a particular order

In addition to these functions, there's the old power-user favourites of MATCH and INDEX which were often used as an alternative to VLOOKUP before XLOOKUP's arrival on the scene! We'll take a look at those too.

Because of their ubiquity we'll look at all of these methods in turn — there's a very real chance that you'll come across them in other people's spreadsheets. But for future work the XLOOKUP is unquestionably the method to favour.

LOOKUP

=LOOKUP(lookup-value,lookup-range,result-range)

LOOKUP is a weird one. Superficially it looks like the perfect lookup function: a supplied lookup-value (e.g. ‘Muffins’) is looked up in one column, and the value is returned from the same row of another column. Here it is in action:

fx=LOOKUP(F2,B:B,D:D)
ABCDEFG
1 IDItemPortion (g)CaloriesLookupReturn
2 1001Apple pie200180Muffins120
3 1002Chocolate cake150200
4 1003Flapjack160210
5 1004Lemon drizzle170200
6 1005Muffins90120
7 1006Sausage rolls85130

But there's a catch. And it's a big one. LOOKUP only works if the lookup range is sorted in ascending order. And if it can't find what you're after it will make a best guess based on where in your list the lookup value should come. If "Muffins" were to appear anywhere else in our list other than in its proper alphabetical place between the lemon drizzle and the sausage rolls, we'd get the wrong result being returned. And that's not good at all.

For that reason, LOOKUP is only really provided in spreadsheets as a legacy function and pretty much everything about it is replicated in a more straightforward way in the XLOOKUP.

VLOOKUP (and HLOOKUP)

VLOOKUP is one of the most commonly used functions in spreadsheets over the last half a century, and so while it may have been superseded by XLOOKUP we're going to need to take a look at it. Let's see what arguments go into the function:

=VLOOKUP(lookup‑value, lookup‑array, result‑index, sorted?)

lookup‑valueThe value you're looking for
lookup‑arrayA data table, the first column of which will be matched to the lookup-value
result‑indexAwkwardly, the number of the column in your lookup-array from which you will retrieve your result, where the first column in the array is 1
sorted?Do you want this lookup to work like LOOKUP, whereby the first column of your lookup-array is sorted in ascending order and 'guesses' can be made as to the value to return (in which case put TRUE here), or are you looking for an exact match in a not-necessarily-sorted range (in which case put FALSE)?

The function's fiddlier than it needs to be, so let's see a (familiar) example:

fx=VLOOKUP(F2,B:D,3,FALSE)
ABCDEFG
1 IDItemPortion (g)CaloriesLookupReturn
2 1001Apple pie200180Muffins120
3 1002Chocolate cake150200
4 1003Flapjack160210
5 1004Lemon drizzle170200
6 1005Muffins90120
7 1006Sausage rolls85130

Some things to notice / be aware of:

  • Our list of data is in columns A:D but we're using B:D as our lookup-array because VLOOKUP needs the matching to happen in the first column of that lookup-array.
  • Column D is the fourth column in the sheet, and the fourth column in our data table, but it's the third column in our lookup-array: B:D, hence the 3 in our formula.
  • Our data in B:B happens to be alphabetical, but we're still looking for "Muffins" so we still want an exact match from our lookup. We should therefore use FALSE in the last argument. If we use TRUE (or miss out the last argument entirely) and the "Muffins" entry wasn't in B:B, we'd get back a misleading 200 from the lemon drizzle instead (more on why in a moment).
  • Because we're using FALSE, the values in column B could be in any order: they don't have to be alphabetical. But if there's more than one entry for "Muffins", only the correspondence from the first entry in the list would be returned.

Range lookups

Let's get to the bottom of this TRUE/FALSE business...

In the example above, the fourth argument was set to FALSE. This means that a calorie value is only returned if "Muffins" is found in column B. Otherwise the formula will return a #N/A error.

So why would we ever want anything else?

Suppose we're marking an exam:

fx=VLOOKUP(C2,F:G,2,TRUE)
ABCDEFG
1 IDExamResultGradeMarkGrade
2 1001168C0Fail
3 1001272B40E
4 1002156D50D
5 1002263C60C
6 1003181A70B
7 1003276B80A
8 1004142E

Our lookup array in F:G gives a set of grade boundaries: "Fail" starts at 0, "E" starts at 40, etc. A table such as this allows us to do a range lookup: rather than looking for an exact match, we want the closest range value below the lookup value supplied. So we put TRUE for that fourth arugment in the VLOOKUP. Now, when we look up the mark of 68 from cell C2, our closest match below 68 is the 60 in row 5 which corresponds to the grade "C".

Range lookups are generally used with numerical data where an exact match is less likely to be a thing. Of course, as with the LOOKUP example we saw earlier, our lookup range needs to be sorted in ascending order for the range lookup to work properly.

Tip

When creating a range lookup array, you must include the lowest possible value – in the above example, for instance, a zero is needed, otherwise values below 40 would generate an error.

HLOOKUP

HLOOKUP works in exactly the same way as VLOOKUP but horizontally rather than vertically: it looks across columns for a match and then down rows for a corresponding value.

XLOOKUP

Probably the best lookup function in Google Sheets and Excel (2019 or after, and Microsoft 365) is XLOOKUP...

=XLOOKUP(lookup-value, lookup-range, result-range, [if-not-found], [match-mode], [search-mode])

Fundamentally, this works like LOOKUP, but defaults to the "exact match" method you'd see when using FALSE in a VLOOKUP. However, the three extra arguments to the function let you do some serious fine-tuning of how the lookup works:

if‑not‑foundYou can set the value that's returned if there's no matching value. By default this is #N/A but you could set it to be something else here without the need for an IFERROR function.
match‑modeA more advanced version of the "sorted?" option in VLOOKUP — in VLOOKUP wildcards are automatically enabled, and inexact matching relies on the sort order of the dataset being interrogated, whereas XLOOKUP doesn't need the data to be sorted and the matching can be configured more precisely:
  •  0 - Exact match (default)
  • -1 - Exact match; if not found, return the next smallest item
  •  1 - Exact match; if not found, return the next largest item
  •  2 - A wildcard match using ? or * to stand for one or many characters
search‑modeHow the search is performed. Because XLOOKUP doesn't need the data to be sorted it has to perform a 'mental' sort as part of its calculations. That's not usually a problem, but for larger datasets it could slow things down. If you've got a lot of data to search, you can potentially use one of these options to help reduce the calculation time:
  •  1 - Start at the top/left (default)
  • -1 - Start at the bottom/right
  •  2 - Binary search of a list sorted in ascending order
  • -2 - Binary search of a list sorted in descending order.

XLOOKUP can work either vertically or horizontally, depending on the ranges selected. The return range can also be wider than a single column or row to return matches from multiple fields in your dataset.

MATCH and INDEX

MATCH and INDEX can be used together for lookup purposes, and were often favoured over VLOOKUP because they didn't rely on hard-coded numbers and column counting. This approach has been superseded by XLOOKUP but you might still come across it so let's take a look:

=MATCH(lookup‑value,lookup‑range,type)Identifies the position of a lookup-value in a lookup-range, based on the type of lookup specified:
 0 - exact match
 1 - range match
-1 - inverted range match (matches upwards rather than downwards)
=INDEX(range,row,[column])Returns a value from a range at the row (and/or column) specified

Let's go back to our list of desserts:

fx
ABCDEF
1 IDItemLookupMATCHINDEX
2 1001Apple pieMuffins61005
3 1002Chocolate cake=MATCH(D2,B:B,0)=INDEX(A:A,E2)
4 1003Flapjack
5 1004Lemon drizzle51005
6 1005Muffins=MATCH(D2,B2:B7,0)=INDEX(A2:A7,E2)
7 1006Sausage rolls

In E2 we're using MATCH to look for the value "Muffins" (from D2) in B:B, while requiring an exact match (0). The result is 6. We're then passing that result to the formula in F2 where we're looking in A:A for the sixth row.

In row 5 we've done the same thing but expressing the ranges from rows 2:7 rather than using whole columns. This gives a different value for the MATCH because the returned row number is the row in the range, not the absolute row number on the sheet. This is also true of INDEX which is why we still get the right answer in F5.

Of course, the MATCH and INDEX functions can be combined, with the MATCH nested inside the INDEX. For example:

=INDEX(A:A,MATCH(D2,B:B,0))

Pivot tables

Pivot tables allow you to rearrange a dataset so as to be able to view it from different perspectives.

In order to do this, the data must be organised in a pivotable way; you cannot create a pivot table from poorly-organised data.

Pivotable data

Let's consider an example:

Students have volunteered to take part in activities to raise funds for charity, and we want to analyse the proceeds to see which activities were most successful, which colleges raised the most, and so on.

We may be tempted to lay our data out like this:

fx
ABCDEFG
1 StudentYearCollegeBean bath10k runParachute jumpTandem joust
2 David Jones2Derwith60.0075.50 55.00
3 Farrokh Bulsara1Alcricke70.0085.0045.50
4 Catherine Bush2Langbrugh 65.5095.5035.00

It's a nice, useable layout: each student with their own row; columns for each activity; values entered for the amounts raised in each case...

The trouble is, that data's already been pivoted!

What does that actually mean? Well, let's say we wanted to tot up the amount raised by each college or yeargroup. It's going to take an awful lot of messing about because our amounts raised are in multiple columns. Really we want to have the amounts raised in a single "Amount" column. And that means we'll also need a corresponding "Activity" column. And that means we need to do something to our data that might seem counter-intuitive — we need to do this:

fx
ABCDE
1 StudentYearCollegeActivityAmount
2 David Jones2DerwithBean bath60.00
3 David Jones2Derwith10k run75.50
4 David Jones2DerwithTandem joust55.00
5 Farrokh Bulsara1AlcrickeBean bath70.00
6 Farrokh Bulsara1AlcrickeParachute jump45.00
7 Farrokh Bulsara1AlcrickeTandem joust85.00
8 Catherine Bush2Langbrugh10k run65.50
9 Catherine Bush2LangbrughParachute jump95.50
9 Catherine Bush2LangbrughTandem joust35.00

This may look odd, particularly the repetition, but it means that each row is a collection of separate data items including all relevant information for each instance of a student doing an activity. And the stuff we want to be able to measure (specifically the amount raised) is in the one easy-to-handle column.

Data structure rules

The rules we talked about regarding lists are doubly true here:

  • Data should be entered down the page, each item occupying a new row, so that each row contains one instance of the value to be used in calculation (e.g. the amount raised in the above example)
  • Each column should contain just one type of information (e.g. text, number, date), and each cell should contain just one value
  • Enter column headings in one row at the top of the list – never use more, and never merge cells for labels
  • Do not repeat attributes across several columns (e.g. month names) no matter how tempting it might be!
  • Do not leave whole rows or columns empty (some blank cells are fine)
  • Stick to one dataset per tab of a spreadsheet file

Planning the ideal data structure

Here’s one approach that may help you plan or check your data structure. We’ll continue to use the student fund-raising example.

First, identify the data that provides the individual values that are most important to your analysis — in this case, it’s the amount raised at a specific activity undertaken by a specific student.

Next, surround this with the collection of related attributes:

Also include ‘attributes of attributes’ if you are likely to make use of them. For example, college and year are technically attributes of the student, but since we are storing all the data in one table, these must also be regarded as attributes of the amount, and included with each value.

These attributes should then be the additional columns used in the data set, so each amount has a corresponding activity, student, college and year.

Lowest common denominator data

Pivot tables work by grouping like values together. So it will look at our data-set, spot three rows of David Joneses and start merging them. Which is all well and good if we've only got one David Jones in our student cohort. But perhaps we've got two of them (like the one from The Lower Third and the one from The Monkees). If we don't want these two entities mingling together into some frightening chimera, we're going to need to provide some means of disambiguation. Perhaps the year and/or college would be sufficient to do that, or maybe we're going to need a unique student ID.

Unpivoting 'pivoted' data

If you've inherited a dataset that just isn't pivotable, there are ways to get that data into a pivotable structure.

Excel

The easiest way of unpivoting is via Excel's Query Editor (more of which later, but this will serve as a tantalising glimpse ahead):

  1. Assign a named range to your list of data;
  2. Select any cell within your list
  3. Choose Data > Get & Transform Data > From Table/Range —the Query Editor will open;
  4. Select the columns containing the values you need to unpivot (in our example, the columns with the amounts);
  5. Choose Transform > Any Column > Unpivot Columns;
  6. Rename your unpivoted columns (they'll be called Attribute and Value by default) — select a column and go to Transform > Any Column > Rename (or just double-click the header);
  7. Hit Home > Close & Load — your restructured data will be written to a new list on a new sheet;
  8. This list will be in the form of an Excel Table: a special kind of list with its own set of conventions. If you'd prefer it to be a normal range, you can go to Table Design > Tools > Convert to Range. This will sever the link between this table and your original data, so if your original data is likely to change you might not want to do that.
Google Sheets

Unpivoting in Google Sheets is a bit harder. There are a few methods you could use, all of which essentially involve building up a coordinate map of your data and then restructuring it in a new location, including this example using OFFSET, ROUND, MOD, COLUMN and ROW.

There's also the "Monkey Tennis" approach by player0 and Ben Collins which unpivots a sheet using this beautiful formula:

=ARRAYFORMULA( IFERROR( SPLIT( TRIM( TRANSPOSE( SPLIT( TRANSPOSE( QUERY( TRANSPOSE( QUERY( TRANSPOSE( IF( Sheet1!B2:Z<>"", Sheet1!A2:A & "🐒" & Sheet1!B1:1 & "🐒" & Sheet1!B2:Z & "🎾", )),, 500000)),, 500000)), "🎾"))), "🐒"), ""))

And with the arrival of the FLATTEN function there's now an even more elegant way to unpivot in Google Sheets:

=ARRAYFORMULA( QUERY( SPLIT( FLATTEN( Sheet1!A2:A & "🥞" & Sheet1!B1:1 & "🥞" & Sheet1!B2:Z), "🥞"), "Select * where Col2 is not null"))

  1. Make a copy of your list on a new sheet and reduce it to an index column and the columns you need to pivot (in our case we'd have the students column as our index and the activities columns for unpivoting. This effectively gives us a header row, a header column, and our values;
  2. Use the "FLATTENed pancake" formula (or the "monkey tennis" formula if you prefer) on a new sheet, making sure that it points to the sheet with your data on (the formula assumes it's called Sheet1). If you have more than 26 columns, you'll need to modify the B2:Z bit of the formula;
  3. The formula basically reduces your list to a coordinate set (using some emoji delimiters) and then breaks it apart in its new form. You'll need to add new headers for your resulting unpivoted data;
  4. In our case we need to add back the student attributes of year and college which we could do with a VLOOKUP (or similar formula) back to our original data.

Pivot table anatomy

A pivot table can consist of the following elements:

  • A field that contains the values to be used in calculation
  • A field to be used as labels down the left of your table (row labels)
  • A field to be used as labels across the top of your table (column labels)
  • A field to be used for filtering your data.

The only real essential there is the values, though values without any breakdown is just a sum total. So you'll probably want to attach some of your attributes to the values so that you can compare them. You might therefore want to consider which attributes you want to compare: one of these could become the row labels, another the column labels. A filter then allows you to limit the rows, columns and values by another field.

Taking the example we've been playing with so far, our values are the amount raised. We could compare individual students (as rows) against the activities they were involved in (as columns), thereby essentially recreating the layout of that original table (albeit minus the year and college):

fx
ABCDE
1 StudentBean bath10k runParachute jumpTandem joust
2 David Jones60.0075.50 55.00
3 Farrokh Bulsara70.0085.0045.50
4 Catherine Bush 65.5095.5035.00

The year or college could then, potentially be used as a filter for our data. Or maybe even two filters. We could even add more column labels or row labels. That way we really could rebuild that first table if we wanted to:

fx
ABCDEFG
1 StudentYearCollegeBean bath10k runParachute jumpTandem joust
2 David Jones2Derwith60.0075.50 55.00
3 Farrokh Bulsara1Alcricke70.0085.0045.50
4 Catherine Bush2Langbrugh 65.5095.5035.00

Here's the thing about pivot tables – the thing that makes them brilliant: it doesn't matter what you put where because it's like Lego — you can just keep trying different things in different places and see what works!

For instance:

Filter

Could be year,college, or activity;
Less likely to be student.
 
 
 

Column labels

Could be student, year, college, or activity.

Row labels

Could be student, year, college, or activity.
 

Values

Will pretty-much almost always be amount.

Creating pivot tables

Having first ensured that your data is pivotable, and that you have an idea as to what numerical values you're going to be working with, it's time to do some pivoting!

Excel
  1. Select any cell in your dataset and choose Insert > Tables > PivotTable — the data range is indicated and a dialogue opens with the range already entered.
  2. Tip

    There's also a Recommended Pivot Tables option in the same place that may give you an idea of what’s possible with your data.

  3. You will probably want your pivot table on a new sheet, but you can opt to put it on an existing sheet (in which case you'll need to define the position of the top-left cell of the area to be used).
  4. Tip

    If you're specifying a location, don't try to put your table any higher than row 3 of the sheet, because you need to make room for filter controls.

  5. Choose OK and the empty framework for your pivot will be created.
  6. Using the panel that appears on the right, drag fields from the list into the appropriate areas below. As you do this, the used fields will be ticked. Alternatively you can tick the required fields and have Excel decide where to put them (you can always drag to rearrange if you disagree with it).
  7. The field you choose for the Values will usually (but not always) be numeric. For numeric data you can choose to display a Sum, Average etc., by choosing the drop-down and selecting Value Field Settings…

  8. As you add more items to your pivot table, you might want to fine-tune its appearance. A contextual Design tab will be available on the ribbon menu with various Layout options. These control aspects such as subtotals and repeated labels.

Google Sheets
  1. Select any cell in your dataset and choose Insert > Pivot table — the data range is indicated and a dialogue opens with the range already entered.
  2. You will probably want your pivot table on a new sheet, but you can opt to put it on an existing sheet (in which case you'll need to define the position of the top-left cell of the area to be used).
  3. Choose Create and the empty framework for your pivot will be created.
  4. Using the Pivot table editor pane on the right, add fields in the locations you require using the Add buttons. The panel will also suggest options it thinks you might want to try.
  5. Options for subtotals and repeated row labels appear in the side menu on the field 'cards' you add. You will also be able to sort your data using the sort controls on these cards.

Some example pivots

The following examples just use the simple nine-row list of three students and their fund-raising antics:

Q1: Which activities generated the most money in each year group?

  • Columns: year
  • Rows: activity
  • Values: amount
fx
ABCD
1 Sum of AmountYear
2 Activity12Grand Total
3 10k run141141
4 Bean bath7060130
5 Parachute jump4595.5140.5
6 Tandem joust8590175
7 Grand Total200386.5586.5

Q2: Which activities were most successful in each college?

  • Columns: college
  • Rows: activity
  • Values: amount
fx
ABCDE
1 Sum of AmountCollege
2 ActivityAlcrickeDerwithLangbrughGrand Total
3 10k run75.565.5141
4 Bean bath7060130
5 Parachute jump4595.5140.5
6 Tandem joust855535175
7 Grand Total200190.5196586.5

Q3: How well did students in year 2 do at raising funds? — which activities raised most?

  • Columns: activity
  • Rows: student
  • Values: amount
  • Filter: year 2
fx
ABCDEF
1 Year2
2
3 Sum of AmountActivity
4 Student10k runBean bathParachute jumpTandem joustGrand Total
5 Catherine Bush65.595.535196
6 David Jones75.560 55190.5
7 Grand Total1416095.590386.5

Tip

If we were doing a proper job at this, our student labels would probably take up more than one column as we'd have a unique studentID and we'd probably divide up the forename and surname fields.

Q4: Which activities by year 2 students raised the most?

  • Columns: none
  • Rows: activity
  • Values: amount
  • Filter: year 2
fx
AB
1 Year2
2
3 ActivitySum of Amount
4 10k run141
5 Parachute jump95.5
6 Tandem joust90
7 Bean bath60
8 Grand Total386.5

Tip

We've sorted the values column in descending order. We can do things like that — we're allowed.

Q5: How many activities did each year-group undertake?

  • Columns: year
  • Rows: activity
  • Values: activity summarised by COUNTA
fx
ABCD
1 Sum of AmountYear
2 Activity12Grand Total
3 10k run22
4 Bean bath112
5 Parachute jump112
6 Tandem joust123
7 Grand Total369

Tip

While we'd generally be working with numerical data for our values, we can also do basic counts of textual information.


ExcelPIVOTBY function

Excel for Microsoft 365 has introduced a PIVOTBY function which allows you to build basic pivot tables via a data function. It can take a number of arguments but only the first four are essential:

=PIVOTBY(row-fields, col-fields, values, function, [field-headers], [row-total-depth], [row-sort-order], [col-total-depth], [col-sort-order], [filter-array], [relative-to])

row‑fields The range used to create the rows of the pivot. It can be multiple columns. If the columns aren't adjacent you can combine them with HSTACK (for example, HSTACK(A:A,D:D) will bring through column A and column D) as rows in the pivot. It can also be blank.
col‑fields The range used to create the columns of the pivot. It can work with multiple columns in the same way as row-fields. It can also be blank.
values The range containing the values to be aggregated in the pivot. Again, it can be more than one column.
function The type of aggregation to be performed on the values (there's a wide range of options). You can apply multiple aggregations using HSTACK (for example, HSTACK(SUM, AVERAGE)
[field‑headers] Determines whether headers are brought through into the pivot.
[row‑total‑depth] Used to toggle total and subtotal rows for the columns
[row‑sort‑order] Used to determine the sort order of the rows. Numbers are assigned to the output columns in your pivot. If your pivot has three columns and you wanted to sort the values in the second column, you would use 2, or -2 to sort in reverse order.
[col‑total‑depth] Used to toggle total and subtotal columns for the rows.
[col‑sort‑order] Used to determine the sort order of the columns. Works in the same way as row-sort-order
[filter‑array] A filter range and logical test, for instance B:B>=2
[relative‑to] Used for aggregate functions in the function argument that require a second argument (for instance, PERCENTOF).

PIVOTBY could be used to generate all of the example pivot tables in the previous section:

Q1: Which activities generated the most money in each year group?

=PIVOTBY(D1:D10,B1:B10,E1:E10,SUM)

You could add field-headers (3) but the repetition of "Activity" might be a bit ugly, so we didn't bother.

Q2: Which activities were most successful in each college?

=PIVOTBY(D1:D10,C1:C10,E1:E10,SUM)

It's the same formula as before but using a different range for the columns.

Q3: How well did students in year 2 do at raising funds? — which activities raised most?

=PIVOTBY(A1:A10,D1:D10,E1:E10,SUM,,,,,,B1:B10=2)

There's a lot of empty fields to get past before we can incorporate the filter here. This filter is specifically to 2nd years (2) — you could create a dynamic filter by referencing a cell rather than hard-entering the value.

Q4: Which activities by year 2 students raised the most?

=PIVOTBY(D1:D10,,E1:E10,SUM,3,,-2,,,B1:B10=2)

This example doesn't bother with any col-fields, but we've turned on field-headers (3to make things a bit clearer. We're also sorting by the second column (the Amount) in descending order (hence the -2.

Q5: How many activities did each year-group undertake?

=PIVOTBY(D1:D10,B1:B10,E1:E10,COUNTA)

A simpler example again, but we've changed the aggregate function to COUNTA which counts anything in the range.

QUERY function (Google Sheets)

Google Sheets

The QUERY function is the most powerful data function in Google Sheets. It provides a method to interrogate a data table and generate a sub-set that is actively linked to the source data but will not change it. This makes it particularly useful in a collaborative context, or where you want to produce several sub-sets from the same data source.

=QUERY(data-array,"query-expression",[headers])

Although you can use it on the sheet that contains the data, you are more likely to use this function on another worksheet in the file. The function is entered in just one cell, which becomes the top left cell of the retrieved data.

data‑arrayThis is the range of cells containing the data. Since you will most likely be referring to another worksheet, it needs to include the sheet name if you use row-column notation, for example:
data_sheet!A1:G50

You can also refer to entire columns if you don’t know how many rows there will be:
data_sheet!A:G

But defining this as a named range makes the syntax easier, and allows easy extension of the number of columns used.

Tip

If you define whole columns in a named range, the query may include blank rows — you can stop them appearing by using an appropriate ‘where’ clause.

"query‑expression"Enclosed in quotes, this is the query expression which defines the columns to be used, criteria to be applied, sort orders and grouping.
The expression can refer to values in other cells in the spreadsheet file.
[headers]An optional argument that tells the query how many header rows are at the top of the data-array.

The "query-expression" argument is a string of text which defines the columns to be used, the sort orders to be applied, filtering criteria, and even grouping options and calculated values. The expression is based on Structured Query Language (SQL), used by database systems, but it's not difficult to work with for things like straightforward sorting and filtering.

Here's some simple examples where dataList is a named range:

=QUERY(dataList, "select A,F,B") Simple selection (and reordering) of three columns
=QUERY(dataList, "select A,F,B order by B desc")Select three columns and sort by one of them (in descending order)
=QUERY(dataList, "select A,F,B order by B desc, C asc")Select three columns and sort by two of them (in different orders)
=QUERY(dataList, "select A,F,B where F=3")Select three columns and filter based on a condition

Tip

Columns referenced in the "where" and "order by" parts of the query do not need to be part of the "select"‑ed columns being pulled through, but do need to exist in the referenced data‑array. So in the third example above, the data can be sorted on column C even though only columns A, F, and B are being presented.

The query expression part of the QUERY function has a number of useful commands and we'll go through them in turn here:

Query expression syntax — select and order by

select

select selects the columns you want to display. These columns can be expressed by their letter names or by their position in the data array (expressed as Col1, Col2, etc.), so if our dataList range starts at column A, then...

=QUERY(dataList, "select A,F,B")
…is the same as…
=QUERY(dataList, "select Col1,Col6,Col2")

The selected columns can be in any order. They just need to exist in the original data, so if the named range dataList is mapped to columns A:F then you couldn't refer to a column G or a Col7 in your query expression.

You don't need to pull through the column as is. You can make certain transformations. For instance, you can do basic maths on values in a queried range using the standard mathematical operators (+-*/). Suppose column F of dataList is a mark out of 100 and we wanted to convert it to a percentage, we could do:

=QUERY(dataList, "select A,F/100,B")

We could then format the column as a percentage in the usual way.

Other transformations you can apply to the selected data include aggregate functions and scalar functions (more of which below).

order by

order by controls the way your data is sorted — for example:

=QUERY(dataList, "select A,F,B order by B desc")
…or…
=QUERY(dataList, "select Col1,Col6,Col2 order by Col2 desc, Col3 asc")

asc orders the data in ascending order by the column specified; desc in descending order. The columns controlling the sort do not need to be the columns select‑ed for display but do need to be available in the data array.

Query expression syntax — where

Using "where" for equalities, inequalities, and exact matches

where allows you to filter your data. As with standard functions, numbers and text require slightly different syntax:

  • As usual, numbers don't use quotes — where A=42
  • Text values are enclosed in 'single quotes' (so as not to interfere with the "double quotes" of the query expression) — where A='Donkey'
  • Dates often require their own special syntax but we'll get to that!

Here's a couple of simple examples:

=QUERY(dataList, "select A,F,B where F=3")Filter to show only the rows where the numerical value 3 is in column F of the dataList array
=QUERY(dataList, "select A,B where F='Sold'")Filter to show only the rows where the value in column F of dataList is the text Sold

The usual symbols for equalities and inequalities apply:

=QUERY(dataList, "select A,F,B where F>3")Filter to show only the rows where a value greater than 3 is in column F
=QUERY(dataList, "select Col1,Col6,Col2 where Col6<=4")Filter to show only the rows where a value less than or equal to 4 is in the 7th column of dataList
=QUERY(dataList, "select A,F,B where F<>2")Filter to show only the rows where the numerical value 2 is not in column F

Because the query expression is a string of text in its own right (encased in "double-quotes" in the QUERY formula), it's necessary to 'break out' of the string to reference any cell values in your spreadsheet. The concatenation operator & is used to join the cell reference to the rest of the query expression:

=QUERY(dataList, "select A,F,B where D="&F2)Filter to show only the rows where the value in column D is the numerical value in cell F2
=QUERY(dataList, "select Col1,Col6,Col2 where Col4>="&F2)Filter to show only the rows where the value in the 4th column in dataList is greater than or equal to the numerical value in cell F2
=QUERY(dataList, "select A,B where F='"&E2&"'")Filter to show only the rows where the value in column F is the text found in cell E2

The latter example needs two &s because it needs to add the final 'single quote' to enclose what is in effect a text string within a text string.

Let's assume that cell E2 contains the word Donkey — then…

=QUERY(dataList, "select A,B where F='Donkey'")
…is fundamentally identical to…
=QUERY(dataList, "select A,B where F='"&E2&"'")

Using "where" to match a text fragment

If you want to use only part of a text value in the where clause, use contains instead of = :

=QUERY(dataList, "select A,B where F contains 'br'")

This would filter for all instances of text containing the string br, returning matches for broken, library, and cobra.

Using "where" with dates and other special formats

Working with dates using the where command can be fiddly. It requires a precise form of syntax:

=QUERY(dataList, "select C,B,E,H where H < date '2005-03-26'")

Notice that:

  • The word date must be used as an identifier in the expression
  • The date is a text string (enclosed in 'single quotes')
  • The date must use the format ‘yyyy-mm-dd’ with a hyphen separator

This creates an interesting challenge if you want to reference a cell value in the expression: if you enter the desired date into your cell in the ‘yyyy-mm-dd’ format, Google Sheets is likely to recognise it as a date and convert it to a ‘proper’ date/time value - which won’t work in the query expression.

To be sure of getting a correct ‘string’ value for the date, enter the date in your usual format and then use functions to generate the string. Suppose the reference date is in cell A1 — so long as it's a recognised date in Google Sheets you could use the TEXT function to convert the date to the appropriate text string:

=QUERY(dataList, "select C,B,E,H where H < date '"&TEXT(A1,"yyyy-mm-dd")&"'")

The file, More Query function examples, includes this and some alternative approaches on the where (dates) tab.

Along with date you can also define formats using number, timeofday, datetime (for combined date and time fields), and boolean (for true/false values).

Using "where" to filter out blank rows

If your data array includes blank rows (which is likely if you've referred to whole columns), you may find that these blanks appear in the query result (which, depending on the sort order, could be at the top!). To stop these appearing you could add a clause to exclude them. If you had blanks in column A then the following would work:

=QUERY(dataList, "select A,F,B where A<>''"Filter to exclude rows where column A has a blank text field
=QUERY(dataList, "select A,F,B where A<>0"Filter to exclude rows where column A has a blank numeric field
=QUERY(dataList, "select A,F,B where A is not null"Filter to exclude rows where column A has any kind of blank field

Combining "where" clauses

You can use where to filter on multiple clauses at once using the Boolean operators and and or:

andWhen and is used, only rows in which both conditions are true are included
orWhen or is used, rows are returned where either condition is true

Here's an example without cell references:

=QUERY(dataList, "select C,B,E,H where E = 'Derwith' and D=2")

...and here's an equivalent one using cell references:

=QUERY(dataList, "select C,B,E,H where E = '"&G2&"' and D="&H2)

Query expression syntax — group by and pivot

group by

When grouping of records is appropriate, aggregate functions can be included in order to calculate averages, totals etc. In this case you must define the columns to be presented, the function to be used and on which column(s) to group the records:

=QUERY(dataList, "select E, avg(F) group by E")

The above example displays values from column E and the average of grouped values in column F, grouping records together that contain the same value in column E.

The available aggregate functions are:

avgThe mean average of the values in the column specified
countThe number of numerical items in the column specified
maxThe highest of the values in the column specified
minThe lowest of the values in the column specified
sumThe sum total of the values in the column specified

Essentially the aggregate function component — avg(F) in this case — is treated as one of the columns in the select part of the query string, with the group by section telling it what to do. You could just have the aggregate function component on its own to return a grand total, for instance...

=QUERY(dataList, "select avg(F)")

...but as soon as you add any other columns into the mix you would need to add the group by component.

pivot

You can use pivot to essentially transpose a grouped aggregate function. It works much like the earlier group by example but we don't need to call the grouping column in the select part:

=QUERY(dataList, "select avg(F) pivot E")

This will create a table with column headers created from the values in column E, and corresponding averages of the values from column F in the next row.

Combining the two

By combining group by and pivot we can build an aggregated cross-reference table:

=QUERY(dataList, "select D, avg(F) group by D pivot E")


Excel

The same results can be achieved in Excel for Microsoft 365 using the PIVOTBY function:

=PIVOTBY(D:D,E:E,F:F,AVERAGE,,0,,0,,A:A<>"")

Creating pivot tables with QUERY

We could use this group and pivot approach to build pivot tables with QUERY. The only thing missing would be the total columns which we would have to add manually. Our five pivot examples from earlier could be rendered with the following formulae:

Q1: Which activities generated the most money in each year group?=QUERY(A1:E10, "select D, sum(E) group by D pivot B")
Q2: Which activities were most successful in each college?=QUERY(A1:E10, "select D, sum(E) group by D pivot C")
Q3: How well did students in year 2 do at raising funds? — which activities raised most?=QUERY(A1:E10, "select A, sum(E) where B=2 group by A pivot D")
Q4: Which activities by year 2 students raised the most?=QUERY(A1:E10, "select D, sum(E) where B=2 group by D order by sum(E) desc")
Q5: How many activities did each year-group undertake?=QUERY(A1:E10, "select D, count(E) group by D pivot B")

Query expression syntax — limit and offset

limit

Most of the time you'll want to return all of your data in a QUERY, but sometimes you might just be after a certain number of rows, and that's where limit comes in handy. It lets you set a fixed number of rows to return. For instance...

=QUERY(dataList, "select A,F,B limit 10")

...will show just 10 rows of data.

offset

offset removes a given number of rows from the top of your data. This is acting on the returned data, not the original dataset, so if you've used order by to sort your data high to low, offset 1 will remove the highest-ranked item. Because of this, you shouldn't use offset to remove header rows — you can use the headers argument of the QUERY function for that.

offset can be useful for returning a specific record in your data; for instance...

=QUERY(dataList, "select A,F,B order by B desc limit 1 offset 9")

...would return only the 10th record in the sorted range (it's offset 9 because the first record would have an offset of 0).

Query expression syntax — scalar functions

As well as the aggregate functions mentioned in the group by section, there are scalar functions we can apply to ranges. upper() and lower() change the case of your data (UPPER CASE and lower case), and the rest work with dates and times:

year(), month(), day(), hour(), minute(), second(), millisecond(), quarter(), dayOfWeek(), and toDate()

There's also dateDiff() which you can use in calculations with two values, and now() which allows you to do calculations with the current date and time. These functions all essentially work like their traditional spreadsheet namesakes.

Here's a couple of examples which assume a date-of-birth in column H of dataList:

=QUERY(dataList, "select A,C,B, dateDiff(now(),H)")

…gets you the number of days elapsed since the birthdate, while…

=QUERY(dataList, "select A,C,B, year(now())-year(H)")

…gets you the number of years elapsed.

Headers

The optional headers argument in the QUERY function tells the QUERY how many rows of headers your data-array has. If this argument is omitted, the QUERY will guess the appropriate value based on the appearance of the data. Otherwise the options are:

0The data has no header rows
1The data has one header row which will be applied as the labels in the first row of the output
2
(etc.)
The data has two header rows which will be merged to form the labels in the first row of the output

If you've performed transformations to your data such as applying aggregate or scalar functions, these transformations will be used as the resulting header label, even if the header argument has been set as 0.

label

Within the query expression you can use label to set custom headers. Given the situation just outlined, this is especially useful when performing transformations. Here's an example that creates new labels for each of the outputted columns (though we could always omit labeling anything we were already happy with):

=QUERY(dataList, "select A, C, B, year(now())-year(H) label A 'Student ID', B 'Forename', C 'Surname', year(now())-year(H) 'Age'")

QUERY examples

These two Google Sheets files include some examples of using the QUERY function:

In both these examples the dataList worksheet includes module results for a number of (fictitious) students. As most students have taken more than one module, they appear several times. There is also a sheet named otherData which is used to populate drop-down lists etc.

In each of the example sheets, cells shaded in green (usually on the third row) contain the query function. Some comments are included to help you understand the way the functions are used.

Feel free to make copies of these to experiment with (File > Make a copy).


There's more QUERY examples on Google's Query Language Reference:

Query Editor (Excel)

Excel

Excel doesn't have the QUERY function but it does have a powerful Query Editor.

We've already seen how the Query Editor can be used to 'unpivot' pivoted data, but it has other uses too, not least in terms of cleaning up data (it's great at identifying dates for instance, even when they're misformatted) and reorganising it (in similar ways to QUERY).

  1. Assign a named range to your list of data;
  2. Select any cell within your list
  3. Choose Data > Get & Transform Data > From Table/Range —the Query Editor will open:
  4. Columns can be sorted and filtered from the dropdown toggles, reordered by dragging, and deleted from the menu ribbon, where there are loads of other tools you can apply to your data. Each change you make is recorded in the applied steps panel to the right of the editor, where you can delete steps or reorder them.
  5. Tip

    In the example above, an index column has been added to disambiguate between students, students' names have been split on space characters, the activities have been 'unpivoted', and the amounts have been reformatted as currency.

  6. When you're happy with your changes, hit Home > Close & Load — your restructured data will be written to a new list on a new sheet.

The outputted list will be in the form of an Excel Table: a special kind of list with its own set of conventions. A link is maintained between your original list and the new table, which means that changes to your original data can be passed across to your query output. This connection is not live by default: to refresh the table manually, select a cell within it and go to Query > Load > Refresh (or any of the other Refresh buttons about the place). You can automate the refresh in Query > Edit > Properties, where there are options to refresh on opening, and refresh every n minutes.

If you'd prefer your outputted table to be a normal range, you can go to Table Design > Tools > Convert to Range. This will sever the link between this table and your original data, so if your original data is likely to change you might not want to do that.

Other data functions

In the last few years, spreadsheets have introduce a whole range of new functions for restructuring and reorganising data. We've already encountered a few of them. But here's some more:

Inter-dimensional travel

There's now a group of functions that can convert between one-dimensional arrays (like A1:A10 or A1:E1) and two-dimensional arrays (like A1:E10).

Collapsing data: 2D → 1D

A two-dimensional array can be reduced to a longer one-dimensional array. TOROW takes a two-dimensional range (like A1:E10) and turns it into one long row, while TOCOL does the same thing but as one long column. Both have options for setting the reading direction (across rows or down columns) and for determining whether to skip empty cells.

Wrapping data: 1D → 2D

A one-dimensional array such as a column or row can be restructured into a two-dimensional table. WRAPROWS takes a one-dimensional range and splits it across multiple rows, starting a new row after a defined number of columns. WRAPCOLS unsurprisingly does the same thing but slitting across multiple columns after a defined number of rows.


These two sets of functions are effectively opposites. If we had a 5×10 range, A1:E10, we could do =TOCOL(A1:E10) to reduce the data to a single column of fifty rows. Let's assume we did that function in cell G1 — that would give us a collapsed list of values in cells G1:G50 — then we could do =WRAPROWS(G1:G50, 5) to wrap the collapsed data across five columns, thereby reconstituting our original table!

Merging data

Being able to work with multiple ranges of data has its uses. Here's some more ways we can dynamically combine existing data…

{Constructed arrays}

Google Sheets

In Google Sheets it's always been possible to merge data by building an array in those curly brace thingummies { }, with commas (,) to divide columns and semicolons (;) to divide rows. For instance…

={"College", "Date founded"; "Derwent", 1965; "Langwith", 1965; "Alcuin", 1967; "Vanbrugh", 1967; "Goodricke", 1968; "Wentworth", 1972}

…would give you a 2×7 table of college founding dates. And any of those values in that constructed array could be cell references, or even ranges so long as the ranges were of a consistent width or height.

But now there are two functions that let you do the same thing. And the good news is that these are also available in Excel for Microsoft 365...

HSTACK and VSTACK

HSTACK takes a list of values or ranges and puts them next to each other across multiple columns. For instance, =HSTACK(A1:A5, A6:A10) would create a new table that moved rows 6:10 of the original table and shoved them to the right of a version of rows 1:5.

VSTACK does the same thing but vertically down rows. So, =VSTACK(A1:E5, G1:K5) would shunt the table at G1:K5 beneath a table made from A1:E5.

You could use these functions to replicate the above college array that uses strings and numbers rather than cell references:

=VSTACK(HSTACK("College", "Date founded"), HSTACK("Derwent", 1965), HSTACK("Langwith", 1965), HSTACK("Alcuin", 1967), HSTACK("Vanbrugh", 1967), HSTACK("Goodricke", 1968), HSTACK("Wentworth", 1972))

VSTACK is also useful for merging tables from multiple sheets:

=VSTACK(Sheet1!A2:Z100, Sheet2!A2:Z100, Sheet3!A2:Z100)

…would bring together the data (but not the headers) from three separate 26×100 ranges. You could wrap each of those referenced ranges in a FILTER function if you needed to leave behind any empty rows.

Exercises
Google Sheets

Google Sheets versions of the files are listed below. In each case you will be prompted to make your own copy of the spreadsheet files, for which you'll need to be signed into a Google account.




Excel

You can download the above Google Sheets for use in Excel by going to File > Download

Excel versions of the exercise files can also be found on university-managed machines at T:\IT Training\Essential Spreadsheets

You can map to that drive at home if you use the University's VPN.

Feedback
X